Search CORE

173 research outputs found

A Stochastic Penalty Model for Convex and Nonconvex Optimization with Big Constraints

Author: Mishchenko Konstantin
Richtárik Peter
Publication venue
Publication date: 31/10/2018
Field of study

The last decade witnessed a rise in the importance of supervised learning applications involving {\em big data} and {\em big models}. Big data refers to situations where the amounts of training data available and needed causes difficulties in the training phase of the pipeline. Big model refers to situations where large dimensional and over-parameterized models are needed for the application at hand. Both of these phenomena lead to a dramatic increase in research activity aimed at taming the issues via the design of new sophisticated optimization algorithms. In this paper we turn attention to the {\em big constraints} scenario and argue that elaborate machine learning systems of the future will necessarily need to account for a large number of real-world constraints, which will need to be incorporated in the training process. This line of work is largely unexplored, and provides ample opportunities for future work and applications. To handle the {\em big constraints} regime, we propose a {\em stochastic penalty} formulation which {\em reduces the problem to the well understood big data regime}. Our formulation has many interesting properties which relate it to the original problem in various ways, with mathematical guarantees. We give a number of results specialized to nonconvex loss functions, smooth convex functions, strongly convex functions and convex constraints. We show through experiments that our approach can beat competing approaches by several orders of magnitude when a medium accuracy solution is required

arXiv.org e-Print Archive

On Optimal Probabilities in Stochastic Coordinate Descent Methods

Author: Richtárik Peter
Takáč Martin
Publication venue
Publication date: 12/10/2013
Field of study

We propose and analyze a new parallel coordinate descent method---`NSync---in which at each iteration a random subset of coordinates is updated, in parallel, allowing for the subsets to be chosen non-uniformly. We derive convergence rates under a strong convexity assumption, and comment on how to assign probabilities to the sets to optimize the bound. The complexity and practical performance of the method can outperform its uniform variant by an order of magnitude. Surprisingly, the strategy of updating a single randomly selected coordinate per iteration---with optimal probabilities---may require less iterations, both in theory and practice, than the strategy of updating all coordinates at every iteration.Comment: 5 pages, 1 algorithm (`NSync), 2 theorems, 2 figure

arXiv.org e-Print Archive

Linearly convergent stochastic heavy ball method for minimizing generalization error

Author: Loizou Nicolas
Richtárik Peter
Publication venue
Publication date: 22/12/2017
Field of study

In this work we establish the first linear convergence result for the stochastic heavy ball method. The method performs SGD steps with a fixed stepsize, amended by a heavy ball momentum term. In the analysis, we focus on minimizing the expected loss and not on finite-sum minimization, which is typically a much harder problem. While in the analysis we constrain ourselves to quadratic loss, the overall objective is not necessarily strongly convex.Comment: NIPS 2017, Workshop on Optimization for Machine Learning (camera ready version

arXiv.org e-Print Archive

Semi-Stochastic Gradient Descent Methods

Author: Konečný Jakub
Richtárik Peter
Publication venue
Publication date: 16/06/2015
Field of study

In this paper we study the problem of minimizing the average of a large number (

n

) of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. The total work needed for the method to output an

\varepsilon

-accurate solution in expectation, measured in the number of passes over data, or equivalently, in units equivalent to the computation of a single gradient of the loss, is

O((\kappa/n)\log(1/\varepsilon))

, where

\kappa

is the condition number. This is achieved by running the method for

O(\log(1/\varepsilon))

epochs, with a single gradient evaluation and

O(\kappa)

stochastic gradient evaluations in each. The SVRG method of Johnson and Zhang arises as a special case. If our method is limited to a single epoch only, it needs to evaluate at most

O((\kappa/\varepsilon)\log(1/\varepsilon))

stochastic gradients. In contrast, SVRG requires

O(\kappa/\varepsilon^2)

stochastic gradients. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find an

10^{-6}

-accurate solution for a problem with

n=10^9

and

\kappa=10^3

.Comment: 19 pages, 3 figures, 2 algorithms, 3 table

arXiv.org e-Print Archive

Accelerated Gossip via Stochastic Heavy Ball Method

Author: Loizou Nicolas
Richtárik Peter
Publication venue
Publication date: 23/09/2018
Field of study

In this paper we show how the stochastic heavy ball method (SHB) -- a popular method for solving stochastic convex and non-convex optimization problems --operates as a randomized gossip algorithm. In particular, we focus on two special cases of SHB: the Randomized Kaczmarz method with momentum and its block variant. Building upon a recent framework for the design and analysis of randomized gossip algorithms, [Loizou Richtarik, 2016] we interpret the distributed nature of the proposed methods. We present novel protocols for solving the average consensus problem where in each step all nodes of the network update their values but only a subset of them exchange their private values. Numerical experiments on popular wireless sensor networks showing the benefits of our protocols are also presented.Comment: 8 pages, 5 Figures, 56th Annual Allerton Conference on Communication, Control, and Computing, 201

arXiv.org e-Print Archive

Coordinate Descent Face-Off: Primal or Dual?

Author: Csiba Dominik
Richtárik Peter
Publication venue
Publication date: 29/05/2016
Field of study

Randomized coordinate descent (RCD) methods are state-of-the-art algorithms for training linear predictors via minimizing regularized empirical risk. When the number of examples (

n

) is much larger than the number of features (

d

), a common strategy is to apply RCD to the dual problem. On the other hand, when the number of features is much larger than the number of examples, it makes sense to apply RCD directly to the primal problem. In this paper we provide the first joint study of these two approaches when applied to L2-regularized ERM. First, we show through a rigorous analysis that for dense data, the above intuition is precisely correct. However, we find that for sparse and structured data, primal RCD can significantly outperform dual RCD even if

d \ll n

, and vice versa, dual RCD can be much faster than primal RCD even if

n \ll d

. Moreover, we show that, surprisingly, a single sampling strategy minimizes both the (bound on the) number of iterations and the overall expected complexity of RCD. Note that the latter complexity measure also takes into account the average cost of the iterations, which depends on the structure and sparsity of the data, and on the sampling strategy employed. We confirm our theoretical predictions using extensive experiments with both synthetic and real data sets

arXiv.org e-Print Archive

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

Author: Horváth Samuel
Richtárik Peter
Publication venue
Publication date: 31/01/2019
Field of study

We provide the first importance sampling variants of variance reduced algorithms for empirical risk minimization with non-convex loss functions. In particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods have the capacity to speed up the training process by an order of magnitude compared to the state of the art on real datasets. Moreover, we also improve upon current mini-batch analysis of these methods by proposing importance sampling for minibatches in this setting. Surprisingly, our approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization. All the above results follow from a general analysis of the methods which works with arbitrary sampling, i.e., fully general randomized strategy for the selection of subsets of examples to be sampled in each iteration. Finally, we also perform a novel importance sampling analysis of SARAH in the convex setting.Comment: 9 pages, 12 figures, 25 pages of supplementary material

arXiv.org e-Print Archive

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Author: Hanzely Filip
Richtárik Peter
Publication venue
Publication date: 15/01/2020
Field of study

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA}, and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.Comment: 61 pages, 6 figures, 3 table

arXiv.org e-Print Archive

Stochastic Reformulations of Linear Systems: Algorithms and Convergence Theory

Author: Richtárik Peter
Takáč Martin
Publication venue
Publication date: 24/01/2020
Field of study

We develop a family of reformulations of an arbitrary consistent linear system into a stochastic problem. The reformulations are governed by two user-defined parameters: a positive definite matrix defining a norm, and an arbitrary discrete or continuous distribution over random matrices. Our reformulation has several equivalent interpretations, allowing for researchers from various communities to leverage their domain specific insights. In particular, our reformulation can be equivalently seen as a stochastic optimization problem, stochastic linear system, stochastic fixed point problem and a probabilistic intersection problem. We prove sufficient, and necessary and sufficient conditions for the reformulation to be exact. Further, we propose and analyze three stochastic algorithms for solving the reformulated problem---basic, parallel and accelerated methods---with global linear convergence rates. The rates can be interpreted as condition numbers of a matrix which depends on the system matrix and on the reformulation parameters. This gives rise to a new phenomenon which we call stochastic preconditioning, and which refers to the problem of finding parameters (matrix and distribution) leading to a sufficiently small condition number. Our basic method can be equivalently interpreted as stochastic gradient descent, stochastic Newton method, stochastic proximal point method, stochastic fixed point method, and stochastic projection method, with fixed stepsize (relaxation parameter), applied to the reformulations.Comment: Accepted to SIAM Journal on Matrix Analysis and Applications. This arXiv version has an additional section (Section 6.2), listing several extensions done since the paper was first written. Statistics: 39 pages, 4 reformulations, 3 algorithm

arXiv.org e-Print Archive

Randomized Quasi-Newton Updates are Linearly Convergent Matrix Inversion Algorithms

Author: Gower Robert M.
Richtárik Peter
Publication venue
Publication date: 23/03/2016
Field of study

We develop and analyze a broad family of stochastic/randomized algorithms for inverting a matrix. We also develop specialized variants maintaining symmetry or positive definiteness of the iterates. All methods in the family converge globally and linearly (i.e., the error decays exponentially), with explicit rates. In special cases, we obtain stochastic block variants of several quasi-Newton updates, including bad Broyden (BB), good Broyden (GB), Powell-symmetric-Broyden (PSB), Davidon-Fletcher-Powell (DFP) and Broyden-Fletcher-Goldfarb-Shanno (BFGS). Ours are the first stochastic versions of these updates shown to converge to an inverse of a fixed matrix. Through a dual viewpoint we uncover a fundamental link between quasi-Newton updates and approximate inverse preconditioning. Further, we develop an adaptive variant of randomized block BFGS, where we modify the distribution underlying the stochasticity of the method throughout the iterative process to achieve faster convergence. By inverting several matrices from varied applications, we demonstrate that AdaRBFGS is highly competitive when compared to the well established Newton-Schulz and minimal residual methods. In particular, on large-scale problems our method outperforms the standard methods by orders of magnitude. Development of efficient methods for estimating the inverse of very large matrices is a much needed tool for preconditioning and variable metric optimization methods in the advent of the big data era.Comment: 42 pages, 6 figures, 2 table

arXiv.org e-Print Archive